I have obtained publicly available data from Capital Bikeshare. Using this data, I have created basic metrics and visualizations and time series forecasting models. I explore the data with graphical techniques to find trends and patterns in the data, then I use this knowledge to build forecasting models to predict daily and monthly demand in the future. I will also identify popular locations and routes, and I will analyze how Capital Bikeshare has managed its fleet and bike stations over time.
The data I used for this project are publicly available on Capital Bikeshare’s website (https://www.capitalbikeshare.com/system-data). That data ranges from when the bikeshare launched in late 2010 up until the end of 2018. There is an observation for each ride taken during this time period. Each observation has information about start and end time, start and end location, duration, bike number, and member type. The dataset is quite large with more than 22 million observations. Let’s take a quick look at the first few observations in the dataset.
| Duration | Start.Date | End.Date | Start.Station | End.Station | Bike.Number | Member.Type |
|---|---|---|---|---|---|---|
| 1012 | 2010-09-20 11:27:04 | 2010-09-20 11:43:56 | M St & New Jersey Ave SE | 4th & M St SW | W00742 | Member |
| 61 | 2010-09-20 11:41:22 | 2010-09-20 11:42:23 | 1st & N St SE | 1st & N St SE | W00032 | Member |
| 2690 | 2010-09-20 12:05:37 | 2010-09-20 12:50:27 | 5th & K St NW | 19th St & Pennsylvania Ave NW | W00993 | Member |
| 1406 | 2010-09-20 12:06:05 | 2010-09-20 12:29:32 | 5th & K St NW | Park Rd & Holmead Pl NW | W00344 | Member |
| 1413 | 2010-09-20 12:10:43 | 2010-09-20 12:34:17 | 19th St & Pennsylvania Ave NW | 15th & P St NW | W00883 | Member |
We now have an idea of what kind of data we have. Later in this report, we will do some exploratory data analysis for the other variables, but for now we will focus on the date variables.
We can use the date variables to obtain hourly, daily, and monthly time series of demand. We will then use forecasting models to predict demand in the future. Forecasting the monthly demand can tell us how much we expect the business to grow and can help guide long-term strategy decisions (i.e. growing the fleet, adding new stations, or expanding to new areas). Hourly and daily forecasts give us a look at the short term and can inform us how to efficiently deploy assets throughout the day and week. We will start by looking at monthly demand.
To obtain a dataset of monthly demand we will have to group the dataset by year and month and then count the number of trips in each group. Once we obtain this monthly dataset, we can view a line graph of the demand over time.
There are a few features of this time series that I would like to point out. We can see an upward trend in the time series. At the beginning, there was rapid growth that has given way to slower, steady growth. We can also see that there is significant annual seasonality with peaks during the summer and troughs during the winter. Furthermore, the variability of the seasonal trend has increased as the overall demand has increased.
This graph does not tell us the full story. As I mentioned earlier, one of the variables in the original dataset is member type. This variable can take on two values: member and casual. Members are the customers with the annual or 30-day membership; casual users are the customers with the 3-day pass, 24-hour pass, or those making a single trip. Below I have made a plot of the time series separated by user type.
We can immediately see that the majority of the demand is generated by members. And while there is growth in both user types, most of the growth in demand has been driven by members. We can also see that the member demand continued to grow in 2018, but casual demand actually decreased noticably in 2018.
When the time series has a seaonal pattern, it also sometimes useful to look at the data in seasonal and month plots. These graphs can be seen below.
The seasonal plot shows us a separate line for each year with a data point for each month. The month plot shows us the opposite: a line for each month with a data point for each year.
These plots show us that demand usually reaches its peak sometime in June, July, or August. We can also see more clearly how demand increased rapidly at first and has since leveled off. The plots also show us that overall demand is down in 2018 in nearly every month. As I mentioned before, though, this is largely due to the decrease in casual demand.
Now that we have a better understanding of the features of the monthly time series, we can begin using this information to build forecasting models. There are numerous different types of forecasting models out there. In this report I skip over talking about the simpler models–as they are inadequate for this data–and go straight to discussing the ARIMA model.
I do not want to go into too much technical detail, but I will briefly discuss how forecasting models work, specifically the ARIMA model. In general, forecasting models are able to predict future values by picking up on historical trends and patterns. The ARIMA (autoregressive integrated moving average) model is currently the most commonly used forecasting model. It is quite robust and can handle a variety of situations. ARIMA works by first removing trend and seasonality through a process called “differencing”, then the model takes advantage of relationships between past and present values to predict future values. We can also add external variables to improve our prediction; this kind of model is called an ARIMAX model. In our case, external variables might be temperature or precipitation.
Before I display the results of the models I have built, I should make note of two things. First, there are several paramters for an ARIMA model that must be manually chosen. So, behind the scenes I have fit several different models and I display the results of the best model. Second, I have not used all of the data to train the model. I initially used 2011-2017 to train, and tested on 2018. However, models built with this training data all overestimated the demand in 2018. This is because the first few years had higher rates of growth than recent years. Forecasting models do not do well when the trends and patterns in the data change over time. For that reason, it is sometimes better to train forecasting models using a subset of the data. In this case, I retrained the models using 2013-2017 data and achieved much better results. Now let us get to the results.
## Series: month.train
## ARIMA(2,1,0)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 sma1
## -0.5561 -0.5359 -0.7625
## s.e. 0.1276 0.1235 0.4689
##
## sigma^2 estimated as 851.4: log likelihood=-228.88
## AIC=465.77 AICc=466.72 BIC=473.17
For the sake of thoroughness, I have output the actual model above, but I won’t discuss it here (Sean and Leah, if you are interestd in knowing more about what this output means, I can explain more when we get a chance to talk on the phone). Instead, lets go straight to examining the accuracy of the model. Below is a graph showing the actual and forecasted monthly demand in 2018.
The graph shows that we have done quite well. There are several different metrics that are used to quantify the accuracy of forecasting models. One of the most common is mean absolute percentage error (MAPE). As with most accuracy metrics, the lower the better. In this case, we achieved a MAPE of 9.27%. This can be interpreted to mean that, on average, the forecast is off by 9.27%. Considering the amount of variability in the data, and the fact that we are forecasting a whole year into the future, this is quite good.
I have only forecasted a year into the future because I only had a year’s worth of testing data to compare against. In theory, though, we can forecast out as far as we want. Below is a graph showing a two year forecast with 80% and 95% confidence internals.
I would like to draw attention to the fact the confidence intervals, which help to measure our uncertainty, increase the farther out the forecast goes. This will happen with any forecasting model, and the reason is fairly intuitive. The farther out in time we go, the more likely it becomes that the trend of the data will change.
As I mentioned earlier, monthly forecasts are helpful in guiding long-term strategy. In this case, our forecasts show that growth in demand is beginning to level out. This could mean several differnt things based off of what actions Capital Bikeshare has been taking. If they have been continually increasing their fleet and number of stations, it could indicate that the current market has become saturated and they need to look to new markets for continued growth. If they have not been increasing their fleet size and number of stations, then this may indicate to them that they need to increase both of these to meet demand.
When I talked to Sean on the phone, he mentioned that one of the things GOTCHA needs to get from its data is information about how to more efficiently deploy its products. For this purpose, daily and hourly forecasts would be much more useful.
To start off with daily forecasting, let us look at graph of daily demand over time.
We can see from the above plot that daily demand follows the same general trend as monthly demand, but with much more variability. There is one odd feature of the time series: there is a sharp peak each year before the normal peak summer months. Lets look at this same plot split by member type to see if we can learn any more.
There is a lot of overlap between the two time series, but we can see that the sharp peak is being caused by casual user demand. This indicates to us that this sharp peak is probably being caused by tourist activity. After some more sleuthing, I figured out that the peak comes around mid-March every year, which lines up with both the National Cherry Blossom Festival and Easter. So, in addition to accounting for variables like temperature and precipitation, our models for daily demand should also try to incorporate information about holidays and events.
Before we proceed to modeling, however, let us look at some more graphs to find any other patterns in the data. In addition to the annual seasonality we had for the monthly dataset, we likely have an added layer of weekly seasonality.
From the side-by-side boxplots, shown above, we can see that there is indeed an added layer of weekly seasonality. Although, the weekly pattern differs between casual users and members. The members are most likely DC residents that use the bike share to commute to work every day; they have higher demand on work days compared to weekends. On the other hand, casual user demand is higher on the weekends.
When we have two groups with distinct patterns like this, it is often more accurate to model the groups with entirely separate forecasting models. For the sake of finishing this report in a timely manner, however, I have decided to forgo a separate model approach. And as you are about to see, I am still able to achieve accurate predictions even when modeling casual user and member demand together.
As I have previously mentioned, there is a large amount of day-to-day variability in this data. I believe that much of this variability can be captured by weather data. So, I have gone ahead and obtained daily temperature and precipitation data from Weather Underground. Below are scatterplots comparing temperature and precipitation against daily demand.
We can see there is a clear relationship between temperature and demand. The correlation between the two variables is 0.77. This tells us the relationship is moderately strong and positive; that is, as temperature increases, demand increases. However, correlation only measures the strength of linear relationships and this data appears to have a quadratic or cubic relationship, so the relationship may be even stronger than the correlation metric indicates. What I mean by a quadratic relationship is that overall demand increases as temperature increases, but demand does appear to begin decreasing slightly as we reach extreme temperatures in the high 80s and 90s.
Precipitation is not as strongly correlated with demand with a correlation of -0.25. The relationship is not as strong, but that does not mean that precipitation can not be useful when trying to forecast daily demand.
In the above graphs, I displayed the daily demands relationships to temperature and precipitation separately, but we can create one regression model with all of the variables in it. The output for such a model is shown below.
##
## Call:
## lm(formula = Count ~ Avg + Avg2 + Avg3 + Precip, data = merged.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8563.9 -1030.9 64.9 1197.5 7580.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.223e+03 1.208e+03 5.978 2.93e-09 ***
## Avg -4.503e+02 7.317e+01 -6.155 1.01e-09 ***
## Avg2 1.432e+01 1.400e+00 10.222 < 2e-16 ***
## Avg3 -9.760e-02 8.471e-03 -11.522 < 2e-16 ***
## Precip -3.993e+03 1.767e+02 -22.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1829 on 1268 degrees of freedom
## Multiple R-squared: 0.7465, Adjusted R-squared: 0.7457
## F-statistic: 933.6 on 4 and 1268 DF, p-value: < 2.2e-16
The only thing I would like to draw your attention to here is adjusted \(R^2\) value at the bottom of the output. This model has an adjusted \(R^2\) of 0.7457. This can be interpreted to mean that temperature and precipitation can jointly explain 74.57% of the variation in the daily demand. We will have to keep this in mind when building our forecasting models.
We will start off building an ARIMA model to see how well we can do without weather data. And if necessary, we will build an ARIMAX model that does include weather data.
I have gone ahead and fit an ARIMA model by training on data from 2016 and 2017. Below is a graph displaying the models predictions against the actual values for the first two months of 2018.
We can see that the model was able to pick up the overall trend and weekly seasonality, but it does a very poor job at capturing the day-to-day variation caused by weather. The MAPE of this model is 27.8%. Let us see how much better we can do by adding weather variables to our model.
As we can see from the line graphs, the ARIMAX model that includes weather data does a much better job at capturing the day-to-day variability in demand. The ARIMAX model achieved a MAPE of 17.0%, compared to 27.8% for the ARIMA model. I should note that in these models I have forecasted using exact temperature and precipitation data. When forecasting in real life, we would not already know the exact weather conditions on a given day in the future. That being said, we can easily adapt this approach to use weather forecasts rather than exact weather data.
I also mentioned earlier that holidays and events can also significantly impact daily demand. In these models I have not incorporated vairables related to events since I was unable to easily find such a dataset, but if we were using these models to make real life business decisions it would be well worth the time to find and incorporate event and holiday data.
For the sake of time I have not built any hourly demand forecasating models, but I will briefly discuss how I would go about doing it. For the most part, it would be very similar to how we modeled daily data. However, there would now be a third layer of seasonality: annual, weekly, and daily. Given that any hourly forecasts would be done on a very short time horizon (a couple of days at most), we would likely be able to ignore annual seasonality. Daily seasonality and weather data would likely be the most important features of the model, so let us look at how demand changes from hour to hour.
We can immediately see that members and casual users once again have different patterns, so it would be worth modeling them with separate models. For the members, we can see that there are two peaks in the morning and evening. These peaks likely represent morning and evening rush hour demand. The casual user demand is relatively stable throughout the day with a slight peak around midday. These two patterns reinforce our belief that members are mostly residents who use the bikeshare for commuting and the casual users are tourists who use the bikeshare to ride around the city.
I would like to briefly talk about what might be different between the forecasting models I have built here and forecasting models for GOTCHA data. Capital Bikeshare operates one bikeshare in a very large, connected ecosystem; whereas, GOTCHA operates across a variety of campuses and towns. There are two possible approaches to modeling data across several locations. The quick and easy approach would be to model the aggregate demand across all locations, then use historical proportions to determine what demand in each location likely is. The more involved, but more accurate, approach would be to build seperate models for each location. The second approach is definitely the better approach as different campuses would have very different school schedules, events, and weather on any given day.
Let us quickly see how Capital Bikeshare has expanded and managed their fleet of bikes over time. Below is a graph of the fleet size.
As of the beginning of 2018, Capital Bikeshare had added 4818 bikes to its fleet. Over the same time period, 285 of those bikes were removed from service, leaving the fleet size at 4533. Before we move on, let us quickly look at some numerical summaries for the distribution of bike lifespans and compare between active and retired bikes.
| Min | Q1 | Med | Q3 | Max | Mean | |
|---|---|---|---|---|---|---|
| Retired | 28 | 556.2 | 1218 | 1796.8 | 2649 | 1248 |
| Active | 243 | 1126.0 | 1979 | 2557.0 | 3013 | 1879 |
From the above table, we can see that retired bikes have a median and mean lifespan of 1218 and 1248 days, respectively. Active bikes have a median and mean lifespan of 1979 and 1879 days, respectively. Unfortunately, without more information I am not able to say why any given bike was removed from the fleet.
I should also note that my estimates of fleet size may be slightly off at different time periods. To determine when a bike entered and left the fleet, I simply looked for the first and last date each bike appeared in the dataset I have. So, it is possible that a bike may have been sitting idle for some time before or after being added or removed from the fleet. This would only bias my estimates slightly, and overall these figure should be roughyl accurate.
We may also be interested in tracking the number of stations over time. We can use the station and date variables to identify the first time each station is used. Then we can group by year and month to get the number of new stations each month. Below is a graph of the number of Capital Bikeshare stations over time.
From the graph we can see that the number of stations added over time roughly matches the growth is ride demand over time. Although, I should note that while earlier graphs seemed to show that growth in demand was slowing, the growth in the number of stations has remained relatively constant. This may indicate that there is a decreasing return from adding new stations after a certain point. That is not to say that it is not worthwhile to continue adding stations. It may be increasing at a lower rate, but demand for the bikes is still growing.
All of what I have just said does not consider the fact that Capital Bikeshare operates throughout the entire DC Metropolitan area, not just the city itself. So, it is possible that the city has already been saturated with stations and bikes, but there may still be room to grow in the Virginia and Maryland suburbs. Let us view the number of stations over time by area.
From this graph, we can see that Capital Bikeshare has been steadily expanding its presence in DC and Arlington County (Virginia) since it first launched. In mid-2012, they expanded to Alexandria and have added only a handful of stations there over time. From my own knowledge of the area, this makes sense as Alexandria is only a small township, not a large county nor city. In mid-2013, they rapidly expanded into Montgomery County (Maryland) and have slowly expanded their presence there ever since. Over time, they have expanded into several other areas in Virginia and Maryland. In particular, they have moved into Fairfax County (Tysons, VA; Reston, VA). Both Tysons and Reston are well established areas, but they have been growing quickly in recent years and may be a great opportunity for Capital Bikeshare to expand their user base.
We now have an idea of what regions the bikeshare operates in, but we have not yet explored what areas have the highest traffic and what direction traffic is moving in throughout the day. Before we move on, though, let’s first just take a quick look at a map of station locations so we can better understand where Capital Bikeshare has a presence.
The dataset I have only listed the street address of each station. To plot points on a map, however, we will need lat-lon coordinates. This shouldn’t be a problem, though. All we need is a Google API key, then we can use the ggmap package to obtain coordinates.
From this map we can get a better idea of how traffic might move around between areas. Tysons and Reston (Fairfax County) likely do not have trips traveling outside of those areas due to the great distance. However, there is probably a good amount of cross traffic between DC and the nearby suburbs of Arlington and southern Montgomery County. In the next section, we will try to quantify and validate the hypotheses that I have just put forth by examing popular areas and the interactions between different areas.
First let us look at the stations that are most used in terms of both pickups and dropoffs. Below are datasets for the ten most popular starting stations and ending stations.
| Start.Station | Med.Count | Mean.Count |
|---|---|---|
| Columbus Circle / Union Station | 147 | 159.49263 |
| Massachusetts Ave & Dupont Circle NW | 147 | 143.06393 |
| Lincoln Memorial | 136 | 156.77807 |
| 15th & P St NW | 115 | 109.67606 |
| Jefferson Dr & 14th St SW | 114 | 131.46802 |
| Henry Bacon Dr & Lincoln Memorial Circle NW | 99 | 111.15086 |
| Thomas Circle | 98 | 99.42659 |
| 4th St & Madison Dr NW | 94 | 105.38428 |
| New Hampshire Ave & T St NW | 89 | 87.75701 |
| Eastern Market Metro / Pennsylvania Ave & 7th St SE | 87 | 84.42125 |
| End.Station | Med.Count | Mean.Count |
|---|---|---|
| Massachusetts Ave & Dupont Circle NW | 162.0 | 160.10500 |
| Columbus Circle / Union Station | 154.5 | 165.13003 |
| Lincoln Memorial | 135.0 | 156.37340 |
| 15th & P St NW | 126.0 | 120.09142 |
| Jefferson Dr & 14th St SW | 115.0 | 135.56865 |
| Henry Bacon Dr & Lincoln Memorial Circle NW | 96.0 | 110.97414 |
| 14th & V St NW | 94.0 | 92.77748 |
| 4th St & Madison Dr NW | 94.0 | 105.52151 |
| Thomas Circle | 93.0 | 94.24894 |
| New Hampshire Ave & T St NW | 89.0 | 87.10080 |
Unsurprisingly, all of the most popular stations are in DC. While knowing which stations are popular can help inform us about how big a bike rack we need at each station, when considering how to efficiently deploy the fleet it may be more useful to consider net number of rides per station. That is, we want to figure out which stations have more bikes leaving than arriving and vice versa.
| Station | Mean.Net.Trips |
|---|---|
| 14th & Irving St NW | 33.38356 |
| 16th & Harvard St NW | 33.18630 |
| 14th & Harvard St NW | 13.13699 |
| Columbia Rd & Belmont St NW | 12.26027 |
| 39th & Calvert St NW / Stoddert | 10.86027 |
| Columbia & Ontario Rd NW | 10.81918 |
| 15th & Euclid St NW | 10.55890 |
| 11th & Kenyon St NW | 10.21644 |
| 14th & Girard St NW | 10.21644 |
| 16th & Irving St NW | 10.12329 |
| Station | Mean.Net.Trips |
|---|---|
| 17th & K St NW / Farragut Square | -24.44110 |
| 13th St & New York Ave NW | -24.05479 |
| Georgetown Harbor / 30th St NW | -20.27747 |
| Lynn & 19th St North | -13.92033 |
| Massachusetts Ave & Dupont Circle NW | -13.01918 |
| Columbus Circle / Union Station | -12.62192 |
| 31st & Water St NW | -12.51648 |
| 8th & H St NW | -12.46027 |
| 34th & Water St NW | -12.17534 |
| M St & Pennsylvania Ave NW | -12.00000 |
The above tables show us the ten stations with the highest postive net trips and the ten stations with the lowest negative net trips. A postive number of net trips indicates that there are more trips beginning than ending at the station. This means Capital Bikeshare would need employees to bring bikes to these stations in order to meet demand. Inversely, stations with negative net trips would have bikes piling up at the station and employees would need to move bikes away from these stations. We could combine this information to match net positive and net negative stations that are nearby each other so that employees can redistribute bikes in the most efficient way possible.
So far we have looked at metrics for individual stations. When we looked at the map of all stations, I briefly hypothesized about how traffic might move between the major areas where Capital Bikeshare has a presence. Let us try to visualize and quantify traffic patterns between these areas. For now we will focus on DC, Arlington, and Alexandria.
| Start.Area | End.Area | Count | Percent |
|---|---|---|---|
| Alexandria | Alexandria | 15279 | 2.28 |
| Alexandria | Arlington | 1706 | 0.25 |
| Alexandria | Other | 11 | 0.00 |
| Alexandria | Washington, DC | 729 | 0.11 |
| Arlington | Alexandria | 676 | 0.10 |
| Arlington | Arlington | 40925 | 6.11 |
| Arlington | Other | 6 | 0.00 |
| Arlington | Washington, DC | 18696 | 2.79 |
| Washington, DC | Alexandria | 108 | 0.02 |
| Washington, DC | Arlington | 6913 | 1.03 |
| Washington, DC | Other | 1431 | 0.21 |
| Washington, DC | Washington, DC | 583267 | 87.09 |
| Start.Area | End.Area | Count | Percent |
|---|---|---|---|
| Alexandria | Alexandria | 19593 | 2.09 |
| Alexandria | Arlington | 1373 | 0.15 |
| Alexandria | Other | 57 | 0.01 |
| Alexandria | Washington, DC | 373 | 0.04 |
| Arlington | Alexandria | 3130 | 0.33 |
| Arlington | Arlington | 55939 | 5.97 |
| Arlington | Other | 29 | 0.00 |
| Arlington | Washington, DC | 13698 | 1.46 |
| Washington, DC | Alexandria | 1092 | 0.12 |
| Washington, DC | Arlington | 18211 | 1.94 |
| Washington, DC | Other | 4610 | 0.49 |
| Washington, DC | Washington, DC | 819110 | 87.40 |